{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chemical similarity using PubChem fingerprints"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pubchempy as pcp\n",
    "from IPython.display import Image"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we'll get some compounds. Here we just use PubChem CIDs to retrieve, but you could search (e.g. using name, SMILES, SDF, etc.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=323&t=l\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coumarin = pcp.Compound.from_cid(323)\n",
    "Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=323&t=l')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=72653&t=l\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coumarin_314 = pcp.Compound.from_cid(72653)\n",
    "Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=72653&t=l')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=108770&t=l\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coumarin_343 = pcp.Compound.from_cid(108770)\n",
    "Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=108770&t=l')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=2244&t=l\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aspirin = pcp.Compound.from_cid(2244)\n",
    "Image(url='https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=2244&t=l')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The similarity between two molecules is typically calculated using molecular fingerprints that encode structural information about the molecule as a series of bits (0 or 1). These bits represent the presence or absence of particular patterns or substructures — two molecules that contain more of the same patterns will have more bits in common, indicating that they are more similar.\n",
    "\n",
    "The PubChem CACTVS fingerprint is available on each compound using the `fingerprint` method. This is returned as a hex-encoded string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'0000037180703000000000000000000000000000000000000000304000000000000000810000001A00000000000C04809800300E80000400880220D208000208002020000888000608C80C262284311A823A20A4C01108A98780C0200E00000000000800000000000000100000000000000000'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coumarin.fingerprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can decode this from hexadecimal and then display as a binary string as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0b1101110001100000000111000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011000001000000000000000000000000000000000000000000000000000000000000001000000100000000000000000000000000011010000000000000000000000000000000000000000000001100000001001000000010011000000000000011000000001110100000000000000000000100000000001000100000000010001000001101001000001000000000000000001000001000000000000010000000100000000000000000100010001000000000000000011000001000110010000000110000100110001000101000010000110001000110101000001000111010001000001010010011000000000100010000100010101001100001111000000011000000001000000000111000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bin(int(coumarin.fingerprint, 16))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is more information about the PubChem fingerprints at <ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt>\n",
    "\n",
    "The most commonly used measure for quantifying the similarity of two fingerprints is the Tanimoto Coefficient, given by:\n",
    "\n",
    "$$ T = \\frac{N_{ab}}{N_{a} + N_{b} - N_{ab}} $$\n",
    "\n",
    "where $N_{a}$ and $N_{b}$ are the number of 1-bits (i.e corresponding to the presence of a pattern) in the fingerprints of molecule $a$ and molecule $b$ respectively. $N_{ab}$ is the number of 1-bits common to the fingerprints of both molecule $a$ and $b$. The Tanimoto coefficient ranges from 0 when the fingerprints have no bits in common, to 1 when the fingerprints are identical.\n",
    "\n",
    "Here's a simple way to calculate the Tanimoto coefficient between two compounds in python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def tanimoto(compound1, compound2):\n",
    "    fp1 = int(compound1.fingerprint, 16)\n",
    "    fp2 = int(compound2.fingerprint, 16)\n",
    "    fp1_count = bin(fp1).count('1')\n",
    "    fp2_count = bin(fp2).count('1')\n",
    "    both_count = bin(fp1 & fp2).count('1')\n",
    "    return float(both_count) / (fp1_count + fp2_count - both_count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tanimoto(coumarin, coumarin)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6011904761904762"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tanimoto(coumarin, coumarin_314)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6011904761904762"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tanimoto(coumarin, coumarin_343)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9529411764705882"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tanimoto(coumarin_314, coumarin_343)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8211382113821138"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tanimoto(coumarin, aspirin)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6123595505617978"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tanimoto(coumarin_343, aspirin)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a nice simple method, but not particularly efficient. If you are looking for better performance, check out Andrew Dalke's work:\n",
    "\n",
    "- [Computing Tanimoto scores, quickly](http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html)\n",
    "- [chemfp](http://chemfp.com)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
